QC report
Introduction
This report contains the QC results for your recent nanologix run. Please check over these QC results carefully to ensure that both the sequencing libraries and nanobody libraries look ok.
Sequencing QC
MultiQC report
Full sequencing run QC is available in the multiQC1 report, but not all sections are necessary for our application. In my opinion, these are the most important things we need to look at:
Sequence counts
Ideally, we want ~100,000 reads for round 2, ~500,000 reads for round 1 and ~2,000,000 reads for round zero to be confident we are close to saturation. These values are indicated on the graph.
What if there are fewer reads?
Honestly, we can probably get away with fewer reads (but of course it’s better to have too many than not enough! Especially so we have some breathing room if there are lots of adapter dimers).
Based on diversity estimates, ~ 75,000 reads for round 2, ~250,000 reads for round 1 and ~1,000,000 reads for round 0 should be enough in most cases. If we have less than this, we might still be ok but should check the sequencing saturation before proceeding with the analysis.
Sequence quality histograms
Ideally the quality should be in the green zone (>Q28) across all bases of the reads. It’s not unusual though for it to dip into the yellow, particularly towards the end of R2.
What if the quality is worse than that?
When using MiSeq (instead of NextSeq*) it’s likely that the quality, particularly in R2 will drop off quite quickly. As part of the pre-processing, reads will be trimmed to a minimum quality of Q20 (99.9% accurate) before proceeding with downstream analyses. So, it only affects our downstream results in so far as that we might be left with fewer reads (because after trimming off low-quality parts, there might not be enough left to merge the reads).
* if you see quality dipping into the red on NextSeq runs then something has gone wrong!
Per sequence quality scores
There should be a strong peak in the green zone (>Q28) for all samples.
What if the peak isn’t in the green zone?
If the peak is in the yellow or red zones, something has gone terribly wrong!
Per sequence GC content
There should be a single peak ~55% GC content.
What if there are multiple peaks?
Given the homogenous nature of nanobodies, we expect only one peak. The presence of multiple peaks suggests that either:
The data probably wasn’t demultiplexed well, and so there are some spike-ins mixed with the nanobody reads.
There are high levels of adapter dimers (in NextSeq data this commonly results in peaks at very high GC content, as adapter dimer reads have a lot of G’s due to the dark cycle)
Neither of these are critical issues, they just ‘waste’ some of our sequencing and so we must ensure that we still have enough reads after taking them into consideration
Adapter dimers
One common problem we can encounter with this protocol is the presence of adapter dimers. These manifest as a low percentage of reads passing through the trimming and merging process. You can check this table to see which samples (if any) had a problem with adapter dimers, and whether they will still have enough reads to meet the thresholds outlined above. You might also want to look at whether any of the various metadata columns seem to correlate in any way with adapter dimer levels.
Nanobody library QC
As a quick quality control, it is good to check a few basic features of the nanobody libraries (as opposed to the sequencing libraries, which were covered in the last section).
Productivity
We expect the overwhelming majority of reads (>90%) to be productive, as these nanobodies come from immunised alpacas. A high level of unproductive reads might indicate problems in cloning, or sequencing errors.
What makes a read productive?
Reads are classified as productive according to the AIRR specifications. They must have:
V and J gene alignments in frame
no stop codon
no frameshift in the V region (relative to the reference)
From here on, only productive reads are considered in the analysis
V, D and J genes
What we’re looking for here is to see a good distribution of the genes across the libraries (a very colourful plot).
What if our plot isn’t so colourful?
If we see certain libraries being made up of almost all a single V (or D or J) gene, then that is a sign that the library may be highly duplicated. This is not necessarily a bad thing, particularly for round 2 samples, but in general we would like to see it kept to a minimum.
Saturation curves
Ideally, we would like to sequence our nanobody libraries to saturation (see all nanobody sequences that are present). This will allow us to estimate the sequence diversity, as well as to adopt a conservative filtering approach (excluding CDR3s with a count of 1) to reduce the impact of sequencing errors. We can check whether our sequencing is at saturation by plotting a saturation curve: taking random samples of size \(x\) from the library, and counting how many unique CDR3s are identified, \(y\). If sequencing is at saturation, the curve will ‘flatten out’ (come to an asymptote) once we have seen all of the possible unique CDR3s in the library. The \(y\) value of this asymptote represents the diversity of the nanobody library.
What are we looking for in these plots?
We want to see that the curves flatten out! If they don’t, this means that there are not enough reads to sequence that library to saturation. The filtering strategy may need to be adjusted, or you may have trouble identifying rare clones. Note that the curves sometimes look a bit ‘jagged’, but this is nothing to worry about (to save time, we only simulate once to create the plot. It would be smoother if we did multiple simulations and averaged them, but I don’t think it’s necessary.
Estimated library diversity
Using the data from the saturation curves, we can estimate the diversity of our nanobody libraries:
How was this estimated?
The diversity of the library is the asymptote of the saturation curve, but reading the asymptotes is laborious, so as a quick heuristic we can find the point at which the first derivative (gradient) of the saturation curve is at its lowest positive value (when the curve reaches its asymptote, the gradient will be zero)